---
title: "Predicting Student Dropout and Academic Success"
authors:
- "Patricia Götz"
- "Lana Kabbani"
- "Noémie Glaus"
- "Estela Gonzalez Vizcarra"
institute: University of Lausanne
date: today
title-block-banner: "#0095C8"
bibliography: reference.bib
csl: https://raw.githubusercontent.com/citation-style-language/styles/master/apa.csl
format:
html:
theme: cosmo
toc: true
toc-depth: 4
code-fold: true
code-tools: true
df-print: paged
self-contained: true
pdf:
toc: false
echo: false
include-in-header:
text: |
\usepackage{fvextra}
\DefineVerbatimEnvironment{Highlighting}{Verbatim}{
commandchars=\\\{\},
breaklines, breaknonspaceingroup, breakanywhere
}
execute:
warning: false
message: false
---
## 1. - Introduction
Student retention and academic success are crucial challenges for higher education institutions worldwide. Recent international observations show rising university dropout trends across multiple regions, including Australia and the United States [@sokolova2024dropout]. Looking closer at Europe, recent data from the German Center for Higher Education Research and Science Studies (2022), show that almost 30% of bachelor’s students in Germany leave university without graduating [@hachmeister2024german]. In Portugal, which is the focus of our analysis, recent data by Statistics Portugal reveal that a considerable portion of young adults (16.8%) aged from 15 to 34 have dropped out at least one level of education during their academic path [@europedata2024portugal]. Moreover, among those who dropped out, over more than half (50.8%) did not complete their tertiary studies, highlighting that higher education represents a critical point of disengagement [@europedata2024portugal].
These figures underline the seriousness of dropouts in higher education and the reinforced need for universities to rely on data-driven insights to identify at-risk students and to design early intervention strategies.
We chose this topic because predicting student dropout not only helps optimize institutional resources but also supports students in achieving their academic goals. Understanding the factors that influence academic success, such as socio-economic background, previous academic performance, or family situation, can improve educational policies and personalized support systems. This subject is particularly meaningful in data science, as it allows us to combine analytical and predictive methods to better understand and prevent student dropout.
## 1.1 - Project Goals
The main objective of this project is to identify the factors that influence students to drop out, stay enrolled, or graduate from higher education. The dataset provides detailed information on each student’s academic performance, socioeconomic background, and demographic profile, offering a comprehensive view of the variables that shape educational outcomes. By the end of our analysis, we seek to identify the most significant combinations of academic and personal factors that influence student success.
First, our analysis will focus on academic performance, examining how variables such as admission grades, semester evaluations, and course results relate to final outcomes. For instance, we will analyze whether early academic performance can serve as a reliable predictor of future dropout risk. We will then explore the influence of socioeconomic and personal factors, including parental education, occupation, and financial situation, to understand their impact on academic achievement. Lastly, the dataset will be used to build and evaluate classification models that predict students’ academic status (Dropout, Enrolled, or Graduate).
In summary, this study combines exploratory analysis, visualization, and predictive modeling to generate actionable insights that help universities detect at-risk students early and strengthen academic success.
## I.3 - Research Questions
- I. How do academic performance indicators and study conditions influence students’ likelihood of graduation or dropout?
- II. What is the impact of demographic and socioeconomic background on students’ probability of dropping out?
a. To what extent do financial factors (debtor status, scholarship holder) affect student retention ?
- III. Can we accurately predict a student’s final status (Dropout, Enrolled, or Graduate) based on their demographic, socioeconomic, and academic characteristics. Which are the most relevant among them?
a.Which features category, academic (grades, units), socioeconomic (debt, scholarship) or demographic (age, gender) contribute the most in predicting students’ dropout?
## 2. - Data
## 2.1 - Data Sourcing
The dataset is publicly available on UCI Machine Learning Repository and was created from multiple databases of higher education institutions in Portugal. It is related to enrolled students in different undergraduate programs and shows how different demographic, socioeconomic and academic factors are related to the dropout. Since the data has already been collected and can be directly downloaded from [UCI MLR - Predict Students' Dropout and Academic Success](https://archive.ics.uci.edu/dataset/697/predict%2Bstudents%2Bdropout%2Band%2Bacademic%2Bsuccess){target="_blank"} - [Accessed on 20th October] , there is no need to collect more data via webscraping or APIs.
## 2.2 - Data Description
The dataset, containing data from a Portuguese higher education institution, is provided as a CSV file, approximately 520 KB in size, and contains detailed information about students’demographic, academic and socio-economic characteristics. It includes 4424 student records and 37 variables (features). After reviewing the dataset variables, we removed two irrelevant ones, resulting in 35 relevant variables selected for analysis.
We didn't encounter any difficult challenges. The dataset was already clean and encoded, so we didn't need to perform variable merging, one-hot encoding or ordinal encoding. We only had to translate categorical variables into readable labels to facilitate our visualization analysis.
### 2.2.1 - Data Loading
```{python}
#| label: setup
# Import libraries
from ucimlrepo import fetch_ucirepo
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.dpi'] = 100
# Load data
dataset = fetch_ucirepo(id=697)
X = np.array(dataset.data.features)
y = np.array(dataset.data.targets)
# Create dataframe
col_names = dataset.variables["name"]
df = pd.DataFrame(np.column_stack((X, y)), columns=col_names)
print(f"Dataset shape: {df.shape}")
```
---
### 2.2.2 - Variable Selection
We selected 35 relevant variables for analysis:
```{python}
#| label: variable-selection
selected_columns = [
"Marital Status",
"Application order",
"Course",
"Daytime/evening attendance",
"Previous qualification",
"Previous qualification (grade)",
"Nacionality",
"Mother's qualification",
"Father's qualification",
"Mother's occupation",
"Father's occupation",
"Admission grade",
"Educational special needs",
"Gender",
"Scholarship holder",
"Age at enrollment",
"Displaced",
"Debtor",
"International",
"Curricular units 1st sem (credited)",
"Curricular units 1st sem (enrolled)",
"Curricular units 1st sem (evaluations)",
"Curricular units 1st sem (approved)",
"Curricular units 1st sem (grade)",
"Curricular units 1st sem (without evaluations)",
"Curricular units 2nd sem (credited)",
"Curricular units 2nd sem (enrolled)",
"Curricular units 2nd sem (evaluations)",
"Curricular units 2nd sem (approved)",
"Curricular units 2nd sem (grade)",
"Curricular units 2nd sem (without evaluations)",
"Unemployment rate",
"Inflation rate",
"GDP",
"Target",
]
df = df[selected_columns].copy()
print(f"Selected {len(selected_columns)} variables")
```
### 2.2.3 - Selected Variable Descriptions
```{python}
#| label: variable-descriptions
#| echo: false
# Create variable information table
variable_info = pd.DataFrame({
'Variable': [
'Marital Status',
'Application order',
'Course',
'Daytime/evening attendance',
'Previous qualification',
'Previous qualification (grade)',
'Nacionality',
"Mother's qualification",
"Father's qualification",
"Mother's occupation",
"Father's occupation",
'Admission grade',
'Educational special needs',
'Gender',
'Scholarship holder',
'Age at enrollment',
'Displaced',
'Debtor',
'International',
'Curricular units 1st sem (credited)',
'Curricular units 1st sem (enrolled)',
'Curricular units 1st sem (evaluations)',
'Curricular units 1st sem (approved)',
'Curricular units 1st sem (grade)',
'Curricular units 1st sem (without evaluations)',
'Curricular units 2nd sem (credited)',
'Curricular units 2nd sem (enrolled)',
'Curricular units 2nd sem (evaluations)',
'Curricular units 2nd sem (approved)',
'Curricular units 2nd sem (grade)',
'Curricular units 2nd sem (without evaluations)',
'Unemployment rate',
'Inflation rate',
'GDP',
'Target'
],
'Description': [
'Student marital status',
'Application preference order',
'Course taken by student',
'Attendance type (daytime or evening)',
'Type of previous qualification',
'Grade of previous qualification',
'Student nationality',
'Educational qualification of mother',
'Educational qualification of father',
'Occupation of mother',
'Occupation of father',
'Admission grade to the program',
'Whether student has special educational needs',
'Student gender',
'Whether student is scholarship holder',
'Age of student at enrollment',
'Whether student is displaced from home',
'Whether student is a debtor',
'Whether student is international',
'Credited units in 1st semester',
'Enrolled units in 1st semester',
'Number of evaluations in 1st semester',
'Approved units in 1st semester',
'Average grade in 1st semester',
'Units without evaluations in 1st semester',
'Credited units in 2nd semester',
'Enrolled units in 2nd semester',
'Number of evaluations in 2nd semester',
'Approved units in 2nd semester',
'Average grade in 2nd semester',
'Units without evaluations in 2nd semester',
'Unemployment rate at time of enrollment',
'Inflation rate at time of enrollment',
'GDP at time of enrollment',
'Student status (Dropout, Enrolled, or Graduate)'
],
'Type': [
'Categorical',
'Categorical',
'Categorical',
'Categorical',
'Categorical',
'Numerical (Continuous)',
'Categorical',
'Categorical',
'Categorical',
'Categorical',
'Categorical',
'Numerical (Continuous)',
'Binary',
'Binary',
'Binary',
'Numerical (Discrete)',
'Binary',
'Binary',
'Binary',
'Numerical (Discrete)',
'Numerical (Discrete)',
'Numerical (Discrete)',
'Numerical (Discrete)',
'Numerical (Continuous)',
'Numerical (Discrete)',
'Numerical (Discrete)',
'Numerical (Discrete)',
'Numerical (Discrete)',
'Numerical (Discrete)',
'Numerical (Continuous)',
'Numerical (Discrete)',
'Numerical (Continuous)',
'Numerical (Continuous)',
'Numerical (Continuous)',
'Categorical'
]
})
# Display table
from IPython.display import Markdown, display
# Create markdown table
table_md = variable_info.to_markdown(index=False)
display(Markdown(table_md))
```
Through this step, we didn't encounter any difficult challenges. The dataset was already clean and encoded, so we didn't need to perform variable merging, one-hot encoding or ordinal encoding. We only had to convert categorical variables into readable labels to facilitate our visualization analysis.
---
## 3.- Preprocessing (Data Cleaning and Wrangling)
One of the most important steps in our project is data cleaning and wrangling. After running the code to check for missing values and undefined numerical data, we found that the dataset contains no missing values, no mistakes and no data entry mistakes.
The dataset was already encoded, and we removed “Application mode” and “Tuition fees up to date” variables because they are not relevant to our research questions. Therefore we dropped two columns from the dataset.
Ensuring that the numeric columns are numeric, categorical variables such as “Gender”, “Debtor”, “Displaced” , “Daytime/Evening attendance” were translated to readable string labels for analysis.
Although we had a well-structured and clean dataset, our main challenge was to determine the reliability of our dataset. We verified if there were any missing values, spotting mistakes, and determined irrelevant variables for our analysis. We pursue our cleaning work with the conversion of the categorical variables. Therefore, the reliable dataset was ready to be analyzed.
```{python}
#| label: data-cleaning
def clean_dataframe(df, col_missing_thresh=0.30, row_missing_thresh=0.50):
"""Clean dataset with missing value handling."""
# Count number of NaNs
df = df.copy()
missing = df.isna().sum()
missing_data = missing[missing > 0]
if len(missing_data) > 0:
print(f"\n⚠️ Missing values found in {len(missing_data)} columns ({missing_data.sum():,} total)\n")
display(missing_data.to_frame('Count'))
else:
print("\n✓ No missing values found!")
print(f"\nShape after cleaning: {df.shape}")
print(f"Missing values: {df.isna().sum().sum()}")
# Drop columns with excessive missing
col_frac = df.isna().mean()
drop_cols = col_frac[col_frac > col_missing_thresh].index.tolist()
if drop_cols:
df.drop(columns=drop_cols, inplace=True)
# Drop rows with excessive missing
row_frac = df.isna().mean(axis=1)
drop_rows = row_frac[row_frac > row_missing_thresh].index
if len(drop_rows):
df = df.drop(index=drop_rows).reset_index(drop=True)
# Coerce numeric types
df = df.apply(lambda s: pd.to_numeric(s, errors="ignore"))
# Impute missing values
for col in df.select_dtypes(include=[np.number]).columns:
if df[col].isna().any():
df[col] = df[col].fillna(df[col].median())
for col in df.select_dtypes(include=["category","object"]).columns:
if df[col].isna().any():
mode = df[col].mode(dropna=True)
if not mode.empty:
df[col] = df[col].fillna(mode.iloc[0])
return df
df = clean_dataframe(df)
print(f"Shape after cleaning: {df.shape}")
print(f"Missing values: {df.isna().sum().sum()}")
```
Although we had a well-structured and clean dataset, our main challenge was to determine the reliability of our dataset. We verified if there were any missing values, spotting mistakes, and determined irrelevant variables for our analysis. We pursue our cleaning work with the conversion of the categorical variables. Therefore, the reliable dataset was ready to be analyzed.
---
## 4. - Exploratory Data Analysis (EDA)
In this section, we explore the dataset to understand the main characteristics of the variables and how they relate to student outcomes (Dropout, Enrolled, Graduate). The goal of the EDA is to identify patterns, detect anomalies, and determine which features are most informative for predicting dropout.
## 4.1 - Target Variable
We begin by examining the distribution of the target variable.
The three student outcomes (Dropout, Enrolled, and Graduate) are highly imbalanced, with Graduates representing the largest group, followed by Dropouts, and a smaller proportion of Enrolled students.
```{python}
#| label: target-distribution
# Recode target
target_col = "Target"
df[target_col] = df[target_col].replace({
0: "Dropout",
1: "Enrolled",
2: "Graduate"
})
df[target_col] = pd.Categorical(
df[target_col],
categories=["Dropout", "Enrolled", "Graduate"],
ordered=True
)
# Visualize
fig, ax = plt.subplots(figsize=(8, 5))
target_counts = df[target_col].value_counts()
colors = ['#e74c3c', '#f39c12', '#2ecc71']
bars = ax.bar(range(len(target_counts)), target_counts.values,
color=colors, edgecolor='black', linewidth=1.5)
ax.set_xticks(range(len(target_counts)))
ax.set_xticklabels(target_counts.index)
ax.set_ylabel('Number of Students', fontsize=11)
ax.set_title('Student Outcomes Distribution', fontsize=13, fontweight='bold')
ax.grid(axis='y', alpha=0.3)
# Add labels
for i, v in enumerate(target_counts.values):
ax.text(i, v + 30, f'{v}\n({v/len(df)*100:.1f}%)',
ha='center', fontweight='bold')
plt.tight_layout()
plt.show()
```
---
## 4.2 - Correlation Analysis
```{python}
#| label: correlation-matrix
#| fig-height: 12
#| fig-width: 14
# Calculate correlations
corr = df.corr(numeric_only=True)
# Create heatmap
plt.figure(figsize=(16, 14))
mask = np.triu(np.ones_like(corr, dtype=bool), k=1)
sns.heatmap(
corr,
mask=mask,
cmap='RdBu_r',
center=0,
vmin=-1,
vmax=1,
annot=True,
fmt='.2f',
square=True,
linewidths=0.5,
cbar_kws={"shrink": 0.8, "label": "Correlation"}
)
plt.title('Correlation Matrix of Numeric Variables',
fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()
```
Based on our correlation analysis, we identified several moderately and highly correlated variable pairs that indicate multicollinearity.
A high correlation between international and nationality students can be observed, therefore we choose to remove the variable “international”, since it won’t be as relevant as the "nacionality" variable.
We can see that the variables “father’s occupation” and “mother’s occupation” are highly correlated, but in this case the correlation reflects social structure. They represent two distinct individuals and two potentially different socioeconomic effects. Same thing applies for “mother’s qualification” and “father’s qualification”.
Although the variables “Curricular units 1st sem (enrolled)”-“Curricular units 2nd sem (enrolled)” and “Curricular units 1st sem (grade)” - “Curricular units 2nd sem (grade)” are respectively highly correlated, we keep them because they provide performance progression across different time periods, which is relevant for predicting dropout. Therefore, we excluded 8 redundant semester variables and one nationality variable.
---
## 4.3 - Feature Selection
```{python}
#| label: feature-selection
# Remove highly correlated features
columns_to_remove = [
"Curricular units 1st sem (credited)",
"Curricular units 1st sem (evaluations)",
"Curricular units 1st sem (approved)",
"Curricular units 1st sem (without evaluations)",
"Curricular units 2nd sem (credited)",
"Curricular units 2nd sem (evaluations)",
"Curricular units 2nd sem (approved)",
"Curricular units 2nd sem (without evaluations)",
"International",
]
df = df.drop(columns=columns_to_remove)
print(f"Removed {len(columns_to_remove)} highly correlated variables")
print(f"Remaining variables: {df.shape[1]}")
```
---
## 4.4 - Outlier Detection
We implemented a type-aware outlier detection strategy that applies different methods based on the nature of each variable:
**Binary variables** (e.g., Gender, Scholarship holder): Outlier detection was skipped entirely, as these variables only contain two valid values (0/1).
**Nominal categorical variables** (e.g., Course, Nationality): No outlier detection applied, as these represent distinct categories without natural ordering. We only reported the number of unique categories present.
**Ordinal categorical variables** (e.g., qualifications, occupations): We reported the number of levels but did not apply outlier detection, as these represent ordered categories rather than continuous measurements.
**Grade variables** (0-200 scale): We checked for values outside the valid range (0-200). According to the dataset documentation, grades in the Portuguese system can range from 0 to 200.
**Count variables** (e.g., enrolled courses): We used a more lenient threshold of 3×IQR (Interquartile Range) rather than the standard 1.5×IQR, as count variables naturally exhibit right-skewed distributions where high values may represent legitimate cases (e.g., students enrolling in many courses).
**Continuous variables** (e.g., Age, GDP, Unemployment rate): We applied the standard Tukey method with 1.5×IQR threshold to identify potential outliers: values below Q1 - 1.5×IQR or above Q3 + 1.5×IQR.
This approach ensures that outlier detection is contextually appropriate for each variable type, reducing false positives while identifying genuine data quality issues.
```{python}
#| label: outlier-detection
def detect_outliers_intelligent(df, var_type_dict):
"""Detect outliers based on variable type using simple statistical rules."""
results = []
# Binary variables - skip
print("\n Binary variables (skipping outlier detection):")
for col in var_type_dict.get('binary', []):
if col in df.columns:
unique_vals = sorted(df[col].dropna().unique())
print(f" - {col}: values = {unique_vals}")
# Nominal categorical
print("\n Nominal Categorical (no natural order):")
for col in var_type_dict.get('nominal', []):
if col not in df.columns:
continue
series = df[col].dropna()
print(f" - {col}: {len(series.unique())} categories")
# Ordinal categorical
print("\n Ordinal Categorical (meaningful order):")
for col in var_type_dict.get('ordinal', []):
if col not in df.columns:
continue
series = df[col].dropna()
print(f" - {col}: {len(series.unique())} levels")
# Grade variables (0-200 scale + Z-score)
print("\n Grade variables (0-200 range + Z-score > 3):")
for col in var_type_dict.get('grades', []):
if col not in df.columns:
continue
series = df[col].dropna()
# Check range violations
invalid = ((series < 0) | (series > 200)).sum()
# Check statistical outliers using Z-score
mean, std = series.mean(), series.std()
if std > 0:
z_scores = np.abs((series - mean) / std)
statistical_outliers = (z_scores > 3).sum()
else:
statistical_outliers = 0
total_outliers = invalid + statistical_outliers
outlier_pct = 100 * total_outliers / len(series) if len(series) > 0 else 0
print(f" - {col}: {invalid} out-of-range + {statistical_outliers} extreme (Z>3) = "
f"{total_outliers} total ({outlier_pct:.1f}%)")
if total_outliers > 0:
results.append({
'column': col, 'type': 'grade',
'issue': 'out_of_range + extreme',
'count': total_outliers, 'pct': outlier_pct
})
# Count variables (Z-score > 3)
print("\n Count variables (Z-score > 3):")
for col in var_type_dict.get('counts', []):
if col not in df.columns:
continue
series = df[col].dropna()
if len(series) == 0:
continue
mean, std = series.mean(), series.std()
if std > 0:
z_scores = np.abs((series - mean) / std)
outliers = (z_scores > 3).sum()
else:
outliers = 0
outlier_pct = 100 * outliers / len(series)
print(f" - {col}: extreme values: {outliers} ({outlier_pct:.1f}%)")
if outliers > 0:
results.append({
'column': col, 'type': 'count',
'issue': 'extreme_outlier',
'count': outliers, 'pct': outlier_pct
})
# Continuous variables (Z-score > 3)
print("\n Continuous variables (Z-score > 3):")
for col in var_type_dict.get('continuous', []):
if col not in df.columns:
continue
series = df[col].dropna()
if len(series) == 0:
continue
mean, std = series.mean(), series.std()
if std > 0:
z_scores = np.abs((series - mean) / std)
outliers = (z_scores > 3).sum()
else:
outliers = 0
outlier_pct = 100 * outliers / len(series)
print(f" - {col}: extreme values: {outliers} ({outlier_pct:.1f}%)")
if outliers > 0:
results.append({
'column': col, 'type': 'continuous',
'issue': 'extreme_outlier',
'count': outliers, 'pct': outlier_pct
})
return pd.DataFrame(results)
# Define variable types
var_types = {
'binary': [
"Daytime/evening attendance", "Educational special needs",
"Gender", "Scholarship holder", "Displaced", "Debtor", "International"
],
'nominal': ["Course", "Nacionality"],
'ordinal': [
"Marital Status", "Application mode", "Application order",
"Previous qualification", "Mother's qualification",
"Father's qualification", "Mother's occupation", "Father's occupation"
],
'grades': [
"Previous qualification (grade)", "Admission grade",
"Curricular units 1st sem (grade)", "Curricular units 2nd sem (grade)"
],
'counts': [
"Curricular units 1st sem (enrolled)",
"Curricular units 2nd sem (enrolled)"
],
'continuous': ["Age at enrollment", "Unemployment rate", "Inflation rate", "GDP"]
}
# Run outlier detection
outlier_results = detect_outliers_intelligent(df, var_types)
```
### 4.5 - Outlier Summary
```{python}
#| label: outlier-summary
if not outlier_results.empty:
outlier_results = outlier_results.sort_values('pct', ascending=False)
print("\n Detected Issues:")
outlier_results
# Visualize problematic variables
for _, row in outlier_results.iterrows():
col = row['column']
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
# Histogram
df[col].hist(bins=30, ax=ax1, edgecolor='black')
ax1.set_title(f"Distribution")
ax1.set_xlabel(col)
ax1.set_ylabel("Frequency")
ax1.grid(alpha=0.3)
# Boxplot
sns.boxplot(y=df[col], ax=ax2)
ax2.set_title(f"Boxplot ({row['type']})")
ax2.grid(alpha=0.3, axis='y')
plt.suptitle(f"{col}: {row['count']} potential outliers ({row['pct']:.1f}%)",
fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()
else:
print("\n✅ No significant outliers detected!")
```
We identified outliers in five different variables.
Curricular units 1st semester (enrolled) and Curricular units 2nd semester (enrolled), which represent the number of courses students register for each semester. We observed 106 potential outliers in the first curricularsemester and 82 in the second semester. Since the average course load is usually 5 to 6 classes, students taking a much higher or lower number of courses are naturally flagged as outliers. In the first semester, the highest value reaches 26 classes. Although this is an ambitious workload, it remains possible. Several situations could explain such a high number: for instance, a student trying to complete their degree quickly, or a student retaking courses after previous failures. These cases can reflect meaningful academic behaviours, so removing them would risk losing useful information. For the second semester, the maximum value is around 20 classes, leading to similar conclusions. In both semesters, we also observe students enrolled in zero courses, which appears as an extreme value as well. This may correspond to students who completed most of their required courses earlier, or students taking a temporary break while still being officially enrolled. These profiles are still relevant and should be included. In this context, these extreme values are not problematic. On the contrary, they may help us understand whether taking unusually many, or unusually few, courses has an impact on college dropout. For this reason, we decided not to remove or cap these observations.
The next variable with detected outliers is Age at enrollment, for which 101 potential outliers were identified. Since the average age at enrollment is around 20 years old, students beginning their studies at 40 or 50 naturally appear as unusual cases. The oldest student is 70 years old, which is uncommon but not problematic for our analysis. Starting university at 70 does not change the fact that this person is a student, and these cases should be included. These values represent real and meaningful student profiles, such as mature students or individuals returning to education after a long break. Excluding them would remove important diversity from the dataset and limit our understanding of the different types of students who may or may not drop out. For this reason, we chose not to remove or cap the age-related outliers.
Finally, outliers were also detected in Admission grade (22 cases) and Previous qualification grade (21 cases). These extreme values reflect either exceptionally high academic performance or, conversely, unusually low grades. Since these cases may provide insights into how prior academic achievement relates to dropout behavior, removing or capping them would not be appropriate. We therefore opted to retain all outliers in these grade variables as well.
Based on our research questions, we conclude that removing these outliers would not benefit our analysis, as they do not represent errors but rather uncommon yet meaningful observations. Retaining them allows us to capture the full diversity of student profiles and provides a more accurate understanding of the factors that may influence college dropout.
## 4.6 - Feature Importance Analysis
### 4.6.1 - Methodology
We used one-way ANOVA (Analysis of Variance) to identify which numeric variables show significant differences across the three target groups (Dropout, Enrolled, Graduate). For each variable, we calculated:
- **p-value**: Statistical significance of differences between groups (α = 0.05)
- **Eta-squared (η²)**: Effect size measure representing the proportion of variance explained by the target variable (ranges from 0 to 1, where higher values indicate stronger association)
Variables with p-value < 0.05 are considered significantly associated with student outcomes and may be strong predictors in classification models.
```{python}
#| label: anova-analysis
# ANOVA for numeric variables
anova_results = {}
numeric_cols = df.select_dtypes(include=np.number).columns
for col in numeric_cols:
groups = [df.loc[df[target_col] == cat, col].dropna()
for cat in df[target_col].cat.categories
if cat in df[target_col].unique()]
# Need at least 2 non-empty groups
if sum(len(g) > 0 for g in groups) < 2:
continue
from scipy.stats import f_oneway
f_val, p_val = f_oneway(*groups)
# Effect size: eta-squared
grand_mean = df[col].mean()
ss_between = sum(len(g) * (g.mean() - grand_mean) ** 2 for g in groups)
ss_total = ((df[col] - grand_mean) ** 2).sum()
eta_sq = ss_between / ss_total if ss_total > 0 else np.nan
anova_results[col] = {"p_value": p_val, "eta_sq": eta_sq}
# Create results dataframe
anova_df = (pd.DataFrame(anova_results).T
.sort_values(["p_value", "eta_sq"], ascending=[True, False]))
anova_df["significant"] = anova_df["p_value"] < 0.05
print(f"Significant variables (p < 0.05): {anova_df['significant'].sum()}")
anova_df.head(15)
```
## 4.7 - Top Predictive Variables
### 4.7.1 - Acadamic Performance Indicators
Our exploratory analysis shows relationships between academic performance measures and student outcomes (Dropout, Enrolled, Graduate). Several patterns emerge across admission grades, semester performance, and course load.
```{python}
#| label: fig-admission-grade
#| fig-cap: "Admission grade by student outcome"
plt.figure(figsize=(8, 5))
sns.boxplot(
x=target_col,
y='Admission grade',
data=df,
palette=['#e74c3c', '#f39c12', '#2ecc71']
)
plt.title('Admission grade by Target', fontsize=12, fontweight='bold')
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
```
In Figure 1, we observe the distribution of the admission grade across the three categories (Dropout, Enrolled Target, and Graduate). Dropout students have an average admission grade of around 122, with several outliers reaching above 160. Enrolled Target students show a very similar average grade to Dropout students, but with fewer extreme values. Graduate students display a slightly higher average admission grade, around 125, and similarly present a few outliers above 160.
Overall, the three groups show comparable distributions, with considerable overlap in their admission grades. Graduate students tend to have a marginally higher average, which may suggest that stronger academic preparation is associated with a greater likelihood of graduating. However, the presence of high admission grades in both the Dropout and Graduate categories indicates that good grades alone do not fully determine academic outcomes. In other words, while admission grade may play a role, it is not a decisive predictor of whether a student will graduate or drop out.
```{python}
#| label: fig-prev-qual-grade
#| fig-cap: "Previous qualification grade by student outcome"
plt.figure(figsize=(8, 5))
sns.boxplot(
x=target_col,
y='Previous qualification (grade)',
data=df,
palette=['#e74c3c', '#f39c12', '#2ecc71']
)
plt.title('Previous qualification (grade) by Target', fontsize=12, fontweight='bold')
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
```
Figure 2 shows the distribution of the Previous qualification (grade) across the three target students which are Dropout, Enrolled, and Graduate. All three boxplots display similar characteristics, with medians around 130-133. The minimum and maximum values are also comparable from 100 to 165. The three groups have multiple outliers at both lower and upper extremes of the grade distribution, dropout and graduates are the one that show more extreme values.
For the interpretation, as the distributions and medians are quite similar this suggests that previous qualification grade is not a strong predictor of students' performance. Interestingly the Dropout group’s median is quite high which indicates that students who drop out have not necessarily lower prior grades than those who graduate or stay enrolled.The outliers indicate that in each category there are both very high and very low grades, which suggests that there are other factors beyond academic performance.
```{python}
#| label: fig-st-sem-grade
#| fig-cap: "First semester grade by student outcome"
plt.figure(figsize=(8, 5))
sns.boxplot(
x=target_col,
y='Curricular units 1st sem (grade)',
data=df,
palette=['#e74c3c', '#f39c12', '#2ecc71']
)
plt.title('Curricular units 1st sem (grade) by Target', fontsize=12, fontweight='bold')
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
```
In Figure 3, we see the first-semester grades for the three target groups: Dropout, Enrolled Target, and Graduate. Dropout students show a wide range of grades. Enrolled Target students have grades around a median of 12.5, with moderate spread. Graduate students have the highest median, around 13.5, and a tighter distribution.
For interpretation, the wide spread of Dropout students suggests that leaving the program is not only due to low grades. Enrolled Target students show average performance, indicating steady progress but not full completion. Graduate students perform consistently better, suggesting that higher and more stable first-semester grades are associated with graduation.
```{python}
#| label: fig-boxplot-grades
#| fig-cap: "Boxplot of Grades by Target"
plt.figure(figsize=(8, 5))
sns.boxplot(
x=target_col,
y='Curricular units 2nd sem (grade)',
data=df,
palette=['#e74c3c', '#f39c12', '#2ecc71']
)
plt.title(f"Curricular units 2nd sem (grade) by {target_col}", fontsize=12, fontweight='bold')
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
```
Figure 4 reveals distinct patterns across the three groups. The Dropout category displays the widest range of performance. Enrolled students demonstrate moderate variability with a median near 12 units. Graduates show the tightest distribution and highest median at approximately 13 units. By the second semester, the gaps between groups widen. Many dropouts completed few or no units (the distribution starts at 0), indicating this is likely when they left the program. Graduates continued performing well with consistent results around 13 units. Enrolled students fell somewhere in between with decent but mixed performance. The second semester appears to be a turning point where struggling students drop out while successful students keep their momentum.
```{python}
#| label: fig-1st-sem-enrolled
#| fig-cap: "First semester enrollment by student outcome"
plt.figure(figsize=(8, 5))
sns.boxplot(
x=target_col,
y='Curricular units 1st sem (enrolled)',
data=df,
palette=['#e74c3c', '#f39c12', '#2ecc71']
)
plt.title('Curricular units 1st sem (enrolled) by Target', fontsize=12, fontweight='bold')
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
```
Figure 5 demonstrates the relationship between Curricular units 1st Sem (enrolled) and the three Target outcomes (Dropout, Enrolled, Graduate). All three groups show similar box positions with medians around 5-6 units. Dropouts and Enrolled students have nearly identical distributions, while Graduates have a slightly higher box position. All groups show numerous outliers, particularly on the upper end, with some students enrolling in 15-26 units.
Figure 5 reveals that the number of courses taken is not a factor influencing different outcomes, since all groups show similar enrollment patterns. Many high outliers appear across all groups, suggesting that ambitious enrollment is common regardless of eventual outcome.
```{python}
#| label: fig-2nd-sem-enrolled
#| fig-cap: "Second semester enrollment by student outcome"
plt.figure(figsize=(8, 5))
sns.boxplot(
x=target_col,
y='Curricular units 2nd sem (enrolled)',
data=df,
palette=['#e74c3c', '#f39c12', '#2ecc71']
)
plt.title('Curricular units 2nd sem (enrolled) by Target', fontsize=12, fontweight='bold')
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
```
As shown in Figure 6, Dropout and Enrolled students have similar distributions with their boxes positioned in the lower range. Graduate students show a noticeably higher box position and a wider spread. All three groups display numerous outliers, particularly on the upper end.
Like the 1st semester enrollment patterns, the 2nd semester shows that graduates tend to enroll in slightly more courses, though the differences remain modest. The similar enrollment behavior between dropouts and enrolled students suggests that course load decisions in the 2nd semester don't strongly differentiate these groups - the key difference lies in completion rates rather than enrollment ambitions.
```{python}
#| label: fig-daytime-evening
#| fig-cap: "Daytime/evening attendance by student outcome"
tab = (pd.crosstab(df[target_col], df['Daytime/evening attendance'])
.apply(lambda r: r / r.sum(), axis=1))
tab = tab.reindex(columns=sorted(tab.columns.tolist()))
tab.plot(kind="bar", stacked=True,
color=['#e74c3c', '#2ecc71'], edgecolor='black')
plt.ylabel("Proportion within target group")
plt.title('Daytime/evening attendance by Target', fontsize=12, fontweight='bold')
plt.legend(title='Attendance', labels=['Evening', 'Daytime'],
bbox_to_anchor=(1.02, 1), loc="upper left")
plt.ylim(0, 1)
plt.grid(alpha=0.3, axis='y')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()
```
Figure 7 shows the proportion of daytime and evening attendance within the three groups (Dropout, Enrolled, Graduate). Daytime attendance dominates across all three groups, representing approximately 85-90% of students. However, Dropout students show a slightly higher proportion of evening attendance (around 15%) compared to Enrolled and Graduate students (around 10%).
This small difference might indicate that evening students face additional challenges, though the similarity across all groups suggests attendance timing is not a primary driver of dropout rates.
```{python}
#| label: fig-application-order
#| fig-cap: "Application order by student outcome"
plt.figure(figsize=(8, 5))
sns.boxplot(
x=target_col,
y='Application order',
data=df,
palette=['#e74c3c', '#f39c12', '#2ecc71']
)
plt.title('Application order by Target', fontsize=12, fontweight='bold')
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
```
Figure 8 shows the Application order by the three target students. All three groups show similar distributions and are positioned in the lower range. The medians are approximately 1.5-2 for all categories. The upper whisker is similar for the three groups, reaching 3 and the lower whisker is at 0 for Graduates and around 1 for Dropout and Enrolled. There are numerous outliers that are at 4, 5 and 6, and even 9 for Enrolled category, indicating that some students applied as their 4th, 5th, 6th and 9th choice.
Regarding the interpretation, as the distribution is similar in the three categories this implies that the application order has not a strong relationship with students' success. Most students have applied to this institution as their first or second choice, suggesting that institutional preferences do not really predict if a student will drop out, stay enrolled or graduate. We can also confirm that, as the outliers are similar, the application order is not a meaningful predictor of a student's performance.
```{python}
#| label: fig-displaced
#| fig-cap: "Displaced status by student outcome"
tab = (pd.crosstab(df[target_col], df['Displaced'])
.apply(lambda r: r / r.sum(), axis=1))
tab = tab.reindex(columns=sorted(tab.columns.tolist()))
tab.plot(kind="bar", stacked=True,
color=['#e74c3c', '#2ecc71'], edgecolor='black')
plt.ylabel("Proportion within target group")
plt.title('Displaced by Target', fontsize=12, fontweight='bold')
plt.legend(title='Displaced', labels=['No', 'Yes'],
bbox_to_anchor=(1.02, 1), loc="upper left")
plt.ylim(0, 1)
plt.grid(alpha=0.3, axis='y')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()
```
Figure 9 shows the proportion of displaced students (those who moved or changed residence) across the three target groups. Dropout students have the highest proportion of non-displaced students at around 53%. Enrolled students show about 45% non-displaced. Graduate students have the lowest at approximately 40% non-displaced, meaning 60% of graduates relocated.
The pattern shows that students who relocated for their studies were more likely to graduate. This could be because moving demonstrates stronger commitment to education, or because staying home means dealing with work, family responsibilities, or other obligations that interfere with studying. Dropouts were the least likely to have relocated, suggesting that remaining in their original environment may have made it harder to focus on academics.
### 4.7.2 - Key Findings for Academic Performance and Study Conditions
Graduates have higher admission grades and previous qualification grades compared to dropouts, though the differences are relatively small. This demonstrates that prior academic preparation shows limited predictive power.
First-semester grades are the strongest predictor of students' performance. Students who drop out show dramatically lower grades (many between 0-5), while graduates consistently have higher grades (median around 12). First semester performance is therefore a critical warning signal for identifying at-risk students.
Graduates tend to enroll in more courses in the first semester (median around 6-7) compared to those who drop out (median around 5-6), this may reflect a stronger initial academic engagement, even though this difference remains small.
Daytime/evening attendance suggests an observable difference and proves to be an important predictor. Evening students show higher drop out rates, around 15% of dropouts compared to 10% for graduates. This reflects additional challenges faced by students who must balance work, or family responsibilities with their studies.
Students who are displaced have higher graduation rates (60% of graduates vs around 48% of dropouts). This counter-intuitive pattern reflects that relocating for studies may reflect stronger commitment or independence.
### 4.7.3 Demographic & Socioeconomic Background
```{python}
#| label: fig-age
#| fig-cap: "Age at enrollment by student outcome"
plt.figure(figsize=(8, 5))
sns.boxplot(
x=target_col,
y='Age at enrollment',
data=df,
palette=['#e74c3c', '#f39c12', '#2ecc71']
)
plt.title('Age at enrollment by Target', fontsize=12, fontweight='bold')
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
```
Figure 10 demonstrates the relationship between Age at enrollment and the three students outcomes. The Dropout group has the highest median age, which is approximately 23 and the widest interquartile range. The Enrolled group has a median age around 20-21, while the Graduate group shows the lowest median age at around 19. It is also the narrowest. The three groups contain numerous outliers, showing particularly older students from late 30th to 70 years old.
This suggests that age at enrollment is a significant predictor of students' performance. We see that students who enroll at a younger age are more likely to graduate, while older students face more risk of dropping out. This can be caused by several factors, such as the fact that younger students may have fewer external responsibilities compared to older students that may deal with multiple commitments that can interfere with their studies. The wider distribution dropout’s group shows that students can occur at any age. However, older students do successfully graduate, showing that age does not determine success alone.
```{python}
#| label: fig-gender
#| fig-cap: "Gender by student outcome"
tab = (pd.crosstab(df[target_col], df['Gender'])
.apply(lambda r: r / r.sum(), axis=1))
tab = tab.reindex(columns=sorted(tab.columns.tolist()))
tab.plot(kind="bar", stacked=True,
color=['#e74c3c', '#2ecc71'], edgecolor='black')
plt.ylabel("Proportion within target group")
plt.title('Gender by Target', fontsize=12, fontweight='bold')
plt.legend(title='Gender', labels=['Female', 'Male'],
bbox_to_anchor=(1.02, 1), loc="upper left")
plt.ylim(0, 1)
plt.grid(alpha=0.3, axis='y')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()
```
Figure 11 shows the proportion of genders within each of the three target groups (Dropout, Enrolled Target, Graduate). In the Dropout group, the proportion of male and female students is almost equal. In contrast, both the Enrolled Target and Graduate groups have a higher proportion of female students than male students.
This suggests that female students tend to persist and complete their studies at higher rates than male students. Male students appear slightly more likely to interrupt or drop out of their programs, which may contribute to the lower proportions observed in the Enrolled Target and Graduate groups.
```{python}
#| label: fig-scholarship
#| fig-cap: "Scholarship holder status by student outcome"
#plt.figure(figsize=(8, 5))
tab = (pd.crosstab(df[target_col], df['Scholarship holder'])
.apply(lambda r: r / r.sum(), axis=1))
tab = tab.reindex(columns=sorted(tab.columns.tolist()))
tab.plot(kind="bar", stacked=True,
color=['#e74c3c', '#2ecc71'], edgecolor='black')
plt.ylabel("Proportion within target group")
plt.title('Scholarship holder by Target', fontsize=12, fontweight='bold')
plt.legend(title='Scholarship holder', labels=['No', 'Yes'],
bbox_to_anchor=(1.02, 1), loc="upper left")
plt.ylim(0, 1)
plt.grid(alpha=0.3, axis='y')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()
```
Figure 12 shows the proportion of scholarship holders within each target group (Dropout, Enrolled Target, Graduate).
In both the Dropout and Enrolled Target groups, the vast majority of students do not receive a scholarship, with only a small proportion being scholarship holders. In contrast, the Graduate group contains a noticeably higher proportion of scholarship recipients.
This suggests that students who receive a scholarship may be more likely to graduate than those who do not. Scholarships often reduce financial pressure and provide support that may help students remain enrolled and complete their studies. Conversely, students without scholarships seem more represented among dropouts and ongoing enrollments.
```{python}
#| label: fig-debtor
#| fig-cap: "Debtor status by student outcome"
tab = (pd.crosstab(df[target_col], df['Debtor'])
.apply(lambda r: r / r.sum(), axis=1))
tab = tab.reindex(columns=sorted(tab.columns.tolist()))
tab.plot(kind="bar", stacked=True,
color=['#e74c3c', '#2ecc71'], edgecolor='black')
plt.ylabel("Proportion within target group")
plt.title('Debtor by Target', fontsize=12, fontweight='bold')
plt.legend(title='Debtor', labels=['No', 'Yes'],
bbox_to_anchor=(1.02, 1), loc="upper left")
plt.ylim(0, 1)
plt.grid(alpha=0.3, axis='y')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()
```
The Figure 13 shows that the dropout group has the largest proportion of students who are debtors. Enrolled students still include some debtors, but the proportion is noticeably smaller. In the graduate group, almost all students have no debt, with only a very small fraction appearing as debtors.
This trend suggests that having debt is more common among students who end up dropping out, hinting that financial pressure may contribute to early departure. Conversely, students without debt seem more likely to remain enrolled and reach graduation.
```{python}
#| label: fig-marital-status
#| fig-cap: "Marital status by student outcome"
plt.figure(figsize=(8, 5))
# Encode target categories as numbers
x_encoded = df[target_col].cat.codes
# Add jitter to avoid overlap
x_jitter = x_encoded + np.random.normal(0, 0.05, size=len(df))
y_jitter = df['Marital Status'] + np.random.normal(0, 0.05, size=len(df))
# Beautiful colormap
colors = plt.cm.viridis(x_encoded / x_encoded.max())
plt.scatter(
x_jitter,
y_jitter,
s=40,
alpha=0.75,
c=colors,
edgecolor="black",
linewidth=0.4
)
plt.xticks(df[target_col].cat.codes, df[target_col])
plt.xlabel("Target")
plt.ylabel("Marital Status")
plt.title("Marital Status by Target", fontsize=12, fontweight='bold')
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
```
@fig-marital-status shows the distribution of marital status across the three target groups using a jittered scatter plot. The points align in nearly identical horizontal bands for each category, indicating that the marital status profiles are essentially the same among Dropout, Enrolled, and Graduate students.
The visual clearly shows that the vast majority of students are single, which is expected given the typical age range of university students. The less common marital statuses—married, divorced, widower, facto union, and legally separated—appear only sporadically and are spread evenly across the three groups.
Overall, the plot suggests that marital status has no meaningful relationship with student outcomes. Since the three distributions look almost identical, marital status offers very limited predictive value for distinguishing at-risk students or for explaining academic success. this aligns with Anova since in the test, Marital Status was above the tenth place in importance.
```{python}
#| label: fig-mother-qualification
#| fig-cap: "Mother's qualification by student outcome"
plt.figure(figsize=(8, 5))
sns.boxplot(
x=target_col,
y="Mother's qualification",
data=df,
palette=['#e74c3c', '#f39c12', '#2ecc71']
)
plt.title("Mother's qualification by Target", fontsize=12, fontweight='bold')
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
```
Figure 15 shows the Mother’s qualification across the three target groups. Enrolled and Graduate distributions are identical. The three groups display nearly identical distributions with all median at approximately 19. However, the Dropout group shows a slightly wider interquartile range, extending lower compared to the two other groups. They reach similar limits and there are no outliers.
While the three distributions are very similar, the dropout category shows slightly fewer students with mothers who have very low qualifications, but this is minimal. The overall similarity indicates that a mother's qualification has little influence on whether students complete their studies.
4.7.2 - Key Findings for Demographic & Socioeconomic Background
Younger students are more likely to graduate while older students face higher dropout risk. The dropout group also shows the widest age variation. This suggests that older students may face competing life responsibilities that interfere with their studies.
Female students graduate at slightly higher rates, making up a larger proportion of both enrolled and graduate groups compared to dropouts (50-50 split).
Graduates have a noticeably higher proportion of scholarship recipients compared to dropouts and enrolled students, where the majority receive no scholarship. This suggests financial support helps complete their studies.
Dropouts have the largest proportion of students who are debtors. Enrolled students include some debtors but fewer than Dropout students. Graduates have almost no debt. Financial pressure is strongly associated with dropout, while financial stability is associated with graduation.
All three groups show identical distributions. Other marital statuses appear equally as outliers across all categories. Marital status has no relationship with student outcomes and no predictive value.
The three groups display nearly identical distributions with medians around 19. While the dropout group shows a slightly wider spread extending lower, the difference is minimal. Mother’s qualification has little to no influence on whether students complete their studies.
## 5 - Methods (Predictive Modelling)
In this section, we begin building a predictive model aimed at understanding which factors are most strongly associated with student dropout. Our goal is not only to classify students into the three outcome categories (Dropout, Enrolled, Graduate), but also to identify which variables contribute most to the risk of dropping out.
We train a Random Forest classifier as a first baseline model. This allows us to evaluate predictive performance and obtain a first indication of which features may be important. To further interpret and validate these results, we use LIME explanations, both at the individual level (example students) and globally across multiple samples.
This modelling part is therefore an exploratory step toward understanding dropout risk: the aim is to identify meaningful patterns, highlight influential academic or demographic factors, and evaluate which features could be most relevant for predicting student success or failure. Later, these insights can be refined and made more specific to dropout prediction.
```{python}
#| label: classification-model
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import lime
import lime.lime_tabular
import matplotlib.pyplot as plt
import seaborn as sns
# Use the dataframe after feature selection (columns already removed)
# This df already has the 8 redundant columns removed
print(f"Starting with {df.shape[1]} features (after feature selection)")
# Create working dataframe - use all columns that exist
df_model = df.copy()
# Remove rows with missing values
print(f"Dataset shape before cleaning: {df_model.shape}")
df_model = df_model.dropna()
print(f"Dataset shape after removing missing values: {df_model.shape}")
print(f"\nTarget distribution:")
print(df_model['Target'].value_counts())
# Separate features and target
X = df_model.drop('Target', axis=1)
y = df_model['Target']
print(f"\nUsing {X.shape[1]} features for classification")
# Encode categorical variables
categorical_cols = X.select_dtypes(include=['object', 'category']).columns
label_encoders = {}
print(f"\n Encoding {len(categorical_cols)} categorical variables...")
for col in categorical_cols:
le = LabelEncoder()
X[col] = le.fit_transform(X[col].astype(str))
label_encoders[col] = le
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"\nTraining set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")
# Train Random Forest model
print("\n Training Random Forest Classifier...")
rf_model = RandomForestClassifier(
n_estimators=100,
max_depth=10,
random_state=42,
n_jobs=-1
)
rf_model.fit(X_train, y_train)
# Make predictions
y_pred = rf_model.predict(X_test)
y_pred_proba = rf_model.predict_proba(X_test)
# Evaluate model
print("\n Model Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# Confusion Matrix
plt.figure(figsize=(8, 6))
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.show()
# Feature Importance from Random Forest
feature_importance = pd.DataFrame({
'feature': X.columns,
'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)
print("\n Top 15 Most Important Features (Random Forest):")
print(feature_importance.head(15))
# Plot feature importance
plt.figure(figsize=(10, 8))
top_15 = feature_importance.head(15)
plt.barh(range(len(top_15)), top_15['importance'])
plt.yticks(range(len(top_15)), top_15['feature'])
plt.xlabel('Importance')
plt.title('Top 15 Feature Importance (Random Forest)')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
# LIME Explainer
print("\n Setting up LIME explainer...")
explainer = lime.lime_tabular.LimeTabularExplainer(
training_data=X_train.values,
feature_names=X.columns.tolist(),
class_names=[str(c) for c in sorted(y.unique())],
mode='classification',
random_state=42
)
# Explain a few predictions
print("\n LIME Explanations for Sample Predictions:")
# Select 3 random test samples
np.random.seed(42)
sample_indices = np.random.choice(X_test.index, size=min(3, len(X_test)), replace=False)
for idx, sample_idx in enumerate(sample_indices):
sample = X_test.loc[sample_idx].values
true_label = y_test.loc[sample_idx]
pred_label = rf_model.predict([sample])[0]
pred_proba = rf_model.predict_proba([sample])[0]
print(f"\n--- Sample {idx + 1} ---")
print(f"True Label: {true_label}")
print(f"Predicted Label: {pred_label}")
print(f"Prediction Probability: {pred_proba}")
# Generate LIME explanation
exp = explainer.explain_instance(
sample,
rf_model.predict_proba,
num_features=10
)
# Show explanation
print("\nTop 10 features influencing this prediction:")
for feature, weight in exp.as_list():
print(f" {feature}: {weight:.3f}")
# Plot explanation
fig = exp.as_pyplot_figure()
plt.tight_layout()
plt.show()
# Global feature importance from LIME (sample-based)
print("\n Computing global LIME feature importance (sampling 100 instances)...")
# Sample instances for global importance
sample_size = min(100, len(X_test))
sample_indices_global = np.random.choice(len(X_test), size=sample_size, replace=False)
lime_weights = {feature: [] for feature in X.columns}
for i in sample_indices_global:
exp = explainer.explain_instance(
X_test.iloc[i].values,
rf_model.predict_proba,
num_features=len(X.columns)
)
for feature, weight in exp.as_list():
# Extract feature name (LIME returns feature with value range)
feature_name = feature.split('<=')[0].split('>')[0].split('=')[0].strip()
# Find matching column (partial match)
for col in X.columns:
if col in feature_name or feature_name in col:
lime_weights[col].append(abs(weight))
break
# Compute average absolute weight for each feature
lime_importance = pd.DataFrame({
'feature': list(lime_weights.keys()),
'lime_importance': [np.mean(weights) if weights else 0 for weights in lime_weights.values()]
}).sort_values('lime_importance', ascending=False)
print("\n Top 15 Most Important Features (LIME Global):")
print(lime_importance.head(15))
# Compare Random Forest vs LIME importance
plt.figure(figsize=(12, 8))
comparison = feature_importance.merge(lime_importance, on='feature', how='left')
comparison['lime_importance'] = comparison['lime_importance'].fillna(0)
top_features = comparison.nlargest(15, 'importance')
x = np.arange(len(top_features))
width = 0.35
plt.barh(x - width/2, top_features['importance'], width, label='Random Forest', alpha=0.8)
plt.barh(x + width/2, top_features['lime_importance'], width, label='LIME', alpha=0.8)
plt.yticks(x, top_features['feature'])
plt.xlabel('Importance Score')
plt.title('Feature Importance Comparison: Random Forest vs LIME')
plt.legend()
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
```
## Conclusion and Next Steps
To the extent covered so far in this project, we achieved the fundamental steps for a rigorous analysis of our work. We described the context of the research background, presented relevant research questions and prepared the dataset for analysis.
We collected and examined the dataset from the UCI Machine Learning Repository, verified its reliability by conducting systematic preprocessing. During this stage, we checked for any missing values, identified anomalies (outliers), translated categorical variables into readable format for visualization, and removed irrelevant or redundant variables. Through EDA, we analyzed academic, demographic and socioeconomic factors, and identified initial patterns associated with the “Target” variable (Dropout, Enrolled, Graduate). In exploring these key insights such as first-semester performance, age at enrollment, scholarship status, and financial stability. We established the preliminary steps for the predictive modeling phase that will be built in the following weeks.
During the upcoming weeks (November 18 to December 7, Weeks 5 to 7), we will focus on building and evaluating the predictive models. We will have to test and determine a suitable modeling technique, such as random forests or decision trees. Our objective is also to assess model performance, using appropriate metrics. We also have to identify the most important variables that contribute to prediction accuracy, and determine which modeling approach best predicts student outcomes related to the research questions.
From December 8 to December 14 (Week 8), we will analyze and interpret the results of the predictive model to determine the insights that help answer our research questions. After evaluating the relevant variables and their effects, we will link the modeling outcomes to the patterns observed during the EDA. This will allow us to provide clearer answers to our research questions.
In the final week (December 15th, week 9), we will dedicate our time to preparing the video presentation and finalizing the written report in Quarto, ensuring coherent and clear findings aligned with our overall analysis.